RFdiffusion Exercise
RFdiffusion binder design
We will start with the ‘manual’ way, running each of the steps in the RFdiffusion -> ProteinMPNN -> Alphafold2 initial guess workflow individually.
Then, we will run the nf-binder-design workflow that combines these steps into a more streamlined pipeline to better suit ‘production’ use on high-performance computing.
The target
(press the spanner icon to see the sequence, )
Here’s version of the PDL1 domain we cropped in the previous exercise: PDL1.pdb
On your server, let’s create a directory for our RFDiffusion work and upload this PDB file to input/PDL1.pdb:
# Start by creating a directory where we will work - this may be your home directory,
# but in this case each participant should create a specifc path with their username in the workshop scratch space
mkdir -p /scratch2/pd27/users/${USER}
cd /scratch2/pd27/users/${USER}
mkdir -p exercises/rfd/input
cd exercises/rfdUpload your trimmed PDL1 PDB file to /scratch2/pd27/users/${USER}/exercises/rfd/input/PDL1.pdb (using rsync, scp or your preferred method).
If you’d like to use a pre-prepared PDL1.pdb rather than your own, run:
wget -O input/PDL1.pdb https://australian-protein-design-initiative.github.io/binder-design-workshop/exercises/rfd/input/PDL1.pdbRunning RFdiffusion binder design
RFdiffusion is a general tool for hallucinating protein structures - not only de novo binder design.
Here, we are going to run RFdiffusion with parameters specific for generating a small de novo binder chains, hopefully with good shape complementarity to our target and near our hotspots.
Let’s start by running the command, and while things are running we can break down what each part does:
# Load the RFdiffusion module that is pre-installed on the M3 HPC cluster
# This puts the RFdiffusion script `run_inference.py` on your PATH
#
module load rfdiffusion/.b44206a
mkdir -p output/rfdiffusion
run_inference.py \
inference.input_pdb=input/PDL1.pdb \
'contigmap.contigs=[A18-132/0 65-120]' \
'ppi.hotspot_res=[A56]' \
inference.output_prefix=output/rfdiffusion/pdl1_test \
inference.num_designs=4 \
denoiser.noise_scale_ca=0 \
denoiser.noise_scale_frame=0Note that this version of the workshop it tailored for the M3 HPC cluster where we’ve pre-installed each tool as a
module. If you are running this on a different system, you will need to adapt these commands to use the Apptainer containers less transparently - see Appendix:Running the Apptainer containers on any system.
RFdiffusion options
inference.input_pdb: our target PDB file- this should contain (possibly cropped) target coordinatescontigmap.contigs: define the regions of the target we want to include (A18-132/0), and a length range for the new chain to generate (65-120)ppi.hotspot_res: our hotspot residuesinference.output_prefix: the prefix for the output files (can be{directory}/{filename}prefix)inference.num_designs: the number of designs (trajectories) to generate - we generate just a small number here - normally this might be 1000 or moredenoiser.noise_scale_caanddenoiser.noise_scale_frame: the noise scale for the translations and rotations - set to zero, since this is reported to improve the quality of the models as the expense of diversity (0.5might also be a reasonable value)
More on the contig syntax
The contig syntax is a way of specifying existing residues in the target to include, and new residues / chains to add by hallucination.
A18-132 says ‘include the existing chain A, residues 18-132’ - we could exclude an N-terminal region like A27-132 - this would be equivalent to deleting those ATOM records.
The /0 at the end of A18-132/0 specifies a chain break. This is important - if you exclude it, the new generated residues will be fused to the C-terminal end of your target !
If we had a second chain B in the target, we might have something like: A18-132/0 B33-148/0 65-120
If we had a missing loop in our target spanning residues 73-83, we would need: A18-72/A84-132/0 65-120 (RFdiffusion will complain with an error if you include residues in a contig that don’t exist)
If a segment does not have a chain ID specfied, like 65-120, this is treated as a new chain to hallicinate, with a lower and upper length range.
How would we generate binders to our PDL1 domain that are exactly 100 residues long ?
You can see the full list of configuration options for RFdiffusion with:
$PREFIX /app/RFdiffusion/scripts/run_inference.py --help… most should probably be left as the defaults.
Some, like inference.ckpt_override_path are automatically set for you to select the correct model weights based on other config options in use.
For binder design, manually setting inference.ckpt_override_path=/models/rfdiffusion/Complex_beta_ckpt.pt can be useful to increase the beta-strand content of designs (this model is reportedly less well tested - YMMV !).
/models/rfdiffusion/ corresponds to the path where your RFdiffusion model weights were downloaded to - /models/rfdiffusion/ is a valid path in the context of the containers we are using here but may be different for other installations of RFdiffusion.
Viewing the results
ProteinMPNN inverse folding
TODO
# Load the dl_binder_design (ProteinMPNN) module that is pre-installed on the M3 HPC cluster
# This puts the script `dl_interface_design.py` (from https://github.com/nrbennet/dl_binder_design/blob/main/mpnn_fr/dl_interface_design.py) on your PATH
#
module load dl_binder_design/.cafa385
mkdir -p output/proteinmpnn
dl_interface_design.py \
-pdbdir input/ \
-relax_cycles 0 \
-seqs_per_struct 2 \
-outpdbdir output/proteinmpnn/ \
-omit_AAs COther useful options:
-checkpoint_path-temperature-augment_eps
Alphafold2 initial guess - scoring by prediction
Once sequences have been generated for the backbone design via inverse folding, we want to predict more accurately if this binder sequence is likely to fold into the correct structure.
TODO
# Load the dl_binder_design (Alphafold2 initial guess) module that is pre-installed on the M3 HPC cluster
# This puts the script `predict.py`(from https://github.com/nrbennet/dl_binder_design/blob/main/af2_initial_guess/predict.py) on your PATH
#
module load dl_binder_design/.cafa385
mkdir -p output/af2_initial_guess/pdbs
predict.py \
-pdbdir output/proteinmpnn \
-outpdbdir output/af2_initial_guess/pdbs/ \
-recycle 3 \
-scorefilename output/af2_initial_guess/pdl1_test.scores.csnf-binder-design : putting it all together
To do a more realistic de novo binder design campaign, you’ll need to generate several 1000s of RFdiffusion trajectories, at least one (probably 2 or 3) proposed sequences for each backbone design with ProteinMPNN, and then run Alphafold2 initial guess on each of these to generate scores and putative complex structures (2 seqs * 1000 backbones = 2000).
You could write some shell script loops, get fancy with SLURM array jobs, write some Python/R/awk/Perl to bring it all together, and turn the commands above into a workflow.
Rather than cobble something together, we suggest you use nf-binder-design a Nextflow pipeline designed to do this. Nextflow is particularly well suited to running workflows like this, and automatically handles software installation (reproducibly across systems via containers), handles HPC queue submission and retries, and generally automates many manual steps as much as practical.
De novo binders against PDL1 with nf-binder-design
Here’s our example above, using the nf-binder-design RFdiffusion workflow:
# Load the nextflow module that is pre-installed on the M3 HPC cluster - any recent version from the last ~12 months should work
module load nextflow/24.04.3
nextflow run -r 0.1.4 Australian-Protein-Design-Initiative/nf-binder-design \
--input_pdb 'input/*.pdb' \
--outdir results \
--contigs "[A18-132/0 65-120]" \
--hotspot_res "A56" \
--rfd_n_designs=4 \
--rfd_batch_size=1 \
--pmpnn_seqs_per_struct=2 \
--pmpnn_relax_cycles=1 \
-profile local \
-resume
# --rfd_filters="rg<=20" \(this is actually a slightly simiplified version of one of the examples in the nf-binder-design Github repository)
You can see all the options available with:
nextflow run -r 0.1.4 Australian-Protein-Design-Initiative/nf-binder-design --helpAs a general rule, parameters are named to match the underlying tool, with an rfd_, pmpnn_ or af2fig prefix. Extra parameters can be configured via --rfd_extra_args or a nextflow.config file:
// nextflow.config
process {
withName: RFDIFFUSION {
ext.args = 'potentials.guiding_potentials=[\"type:binder_ROG,weight:7,min_dist:10\"] potentials.guide_decay="quadratic"'
}
}-r 0.1.4 above specifies the ‘release’ to run (version 0.1.4 in this case). If you omit this, you’ll get the latest development version - keep in mind some settings might change between versions.